Search CORE

273 research outputs found

P-CNN: Pose-based CNN Features for Action Recognition

Author: Chéron Guilhem
Laptev Ivan
Schmid Cordelia
Publication venue
Publication date: 23/09/2015
Field of study

This work targets human action recognition in video. While recent methods typically represent actions by statistics of local video features, here we argue for the importance of a representation derived from human pose. To this end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition. The descriptor aggregates motion and appearance information along tracks of human body parts. We investigate different schemes of temporal aggregation and experiment with P-CNN features obtained both for automatically estimated and manually annotated human poses. We evaluate our method on the recent and challenging JHMDB and MPII Cooking datasets. For both datasets our method shows consistent improvement over the state of the art.Comment: ICCV, December 2015, Santiago, Chil

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Antitrust Standing of Target Corporations to Enjoin Hostile Takeovers Under Section 16 of the Clayton Act

Author: Laptev Ivan
Miech Antoine
Sivic Josef
Publication venue: FLASH: The Fordham Law Archive of Scholarship and History
Publication date: 01/01/1987
Field of study

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Fordham University School of Law

Context-aware CNNs for person head detection

Author: Laptev Ivan
Osokin Anton
Vu Tuan-Hung
Publication venue
Publication date: 24/11/2015
Field of study

Person detection is a key problem for many computer vision tasks. While face detection has reached maturity, detecting people under a full variation of camera view-points, human poses, lighting conditions and occlusions is still a difficult challenge. In this work we focus on detecting human heads in natural scenes. Starting from the recent local R-CNN object detector, we extend it with two types of contextual cues. First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image. Second, we explicitly model pairwise relations among objects and train a Pairwise CNN model using a structured-output surrogate loss. The Local, Global and Pairwise models are combined into a joint CNN framework. To train and test our full model, we introduce a large dataset composed of 369,846 human heads annotated in 224,740 movie frames. We evaluate our method and demonstrate improvements of person head detection against several recent baselines in three datasets. We also show improvements of the detection speed provided by our model.Comment: To appear in International Conference on Computer Vision (ICCV), 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Occlusion resistant learning of intuitive physics from videos

Author: Dupoux Emmanuel
Laptev Ivan
Riochet Ronan
Sivic Josef
Publication venue
Publication date: 30/04/2020
Field of study

To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Weakly-supervised learning of visual relations

Author: Peyre Julia
Laptev Ivan
Schmid Cordelia
Sivic Josef
Publication venue
Publication date: 01/12/1979
Field of study

This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset

arXiv.org e-Print Archive

Kansai Gaidai University Repository

Hal - Université Grenoble Alpes

Weakly-supervised learning of visual relations

Author: Laptev Ivan
Peyre Julia
Schmid Cordelia
Sivic Josef
Publication venue
Publication date: 29/07/2017
Field of study

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

On Pairwise Costs for Network Flow Multi-Object Tracking

Author: Chari Visesh
Lacoste-Julien Simon
Laptev Ivan
Sivic Josef
Publication venue
Publication date: 05/05/2015
Field of study

Multi-object tracking has been recently approached with the min-cost network flow optimization techniques. Such methods simultaneously resolve multiple object tracks in a video and enable modeling of dependencies among tracks. Min-cost network flow methods also fit well within the "tracking-by-detection" paradigm where object trajectories are obtained by connecting per-frame outputs of an object detector. Object detectors, however, often fail due to occlusions and clutter in the video. To cope with such situations, we propose to add pairwise costs to the min-cost network flow framework. While integer solutions to such a problem become NP-hard, we design a convex relaxation solution with an efficient rounding heuristic which empirically gives certificates of small suboptimality. We evaluate two particular types of pairwise costs and demonstrate improvements over recent tracking methods in real-world video sequences

arXiv.org e-Print Archive

CiteSeerX

Crossref

Unsupervised Object Discovery and Tracking in Video Collections

Author: Cho Minsu
Kwak Suha
Laptev Ivan
Ponce Jean
Schmid Cordelia
Publication venue
Publication date: 14/05/2015
Field of study

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision. We formulate the problem as a combination of two complementary processes: discovery and tracking. The first one establishes correspondences between prominent regions across videos, and the second one associates successive similar object regions within the same video. Interestingly, our algorithm also discovers the implicit topology of frames associated with instances of the same object class across different videos, a role normally left to supervisory information in the form of class labels in conventional image and video understanding methods. Indeed, as demonstrated by our experiments, our method can handle video collections featuring multiple object classes, and substantially outperforms the state of the art in colocalization, even though it tackles a broader problem with much less supervision

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

포항공과대학교